Introduction

This document contains descriptive statistics and statistical analyses of data from a norming study.

Background

We are interested in addressing several questions about ambiguity in the mental lexicon:

  1. Does the contextual distance between two usages of a wordform impact the ease with which a comprehender transitions between those contexts?
  2. Do word senses "exist" in the mental lexicon? That is, does the mental lexicon organize the vast array of contexts in which a wordform occurs into distinct categories (i.e., senses)?
  3. Does the mental lexicon organize polysemous and homonymous meanings differently?

To this end, we adapted a set of stimuli from previous studies, which will ultimately be used in a primed sensibility judgment task. The central question is whether the ease of transitioning between two usages of a wordform are impacted, and how, by each of the relevant theoretical variables:

  1. The cosine distance between the contextualized representations of that wordform (as measured/obtained by BERT and ELMo).
  2. Whether the two usages cross a sense boundary (as determined by Merriam-Webster/OED).
  3. For different-sense usages, whether the relationship is one of homonymy or polysemy (again, as determined by Merriam-Webster/OED).

Description

Stimuli were adapted from previous work. Each "item" (or word) was used in four possible sentences, corresponding to two distinct senses. The grammatical category of the word was always the same across sentences, even if it had a different meaning (i.e., always a Noun, or always a Verb).

Thus, there are six possible pairwise comparisons for each word.

For example, the word "lamb" might be used in the following sentences:

1a. They liked the marinated lamb. 
1b. They liked the grilled lamb.    
2a. They liked the cute lamb.   
2b. They liked the friendly lamb.

The first two correspond to the food sense of "lamb", and the second two correspond to the animal sense. Of course, these two senses are clearly related. For other wordforms, the senses are less similar, or even entirely unrelated, as is the case for Homonymy:

1a. It was a windy port.    
1b. It was a seaside port.  
2a. It was a delicious port.    
2b. It was a sweet port.

As noted above, Same/Different Sense was determined by consulting Merriam-Webster and the OED. There were several cases in which it was difficult to tell whether two usages were in fact different senses (e.g., glossy magazine and weekly magazine); these were marked Unsure under Ambiguity Type. They were included in this norming study, but will ultimately be excluded from future experiments (as well as the publicly available normed relatedness judgments).

We also hand-annotated the Ambiguity Type for each of the Different Sense usages. If two usages corresponded to different entries, the relation was listed as Homonymy; if two usages were different sub-entries or "senses" under the same entry, the relation was listed as Polysemy. We did not annotate for more fine-grained polysemous relations (e.g., Metaphor vs. Metonymy), but future work could benefit from a more granular analysis.

Purpose of norming study

Before running our primary task, we sought to norm each of these items. This would serve several purposes:

  1. An overall validation of our manipulation: if relatedness judgments do not vary as a function of same/different sense (or of Ambiguity Type), it suggests the central manipulation is not successful.
  2. Identifying potentially problematic stimuli: e.g., if specific words consistently elicit lower-than-average ratings for same sense usages, that suggests those stimuli should be removed or modified.
  3. Assessing the "Unsure" items.
  4. Developing a resource/metric for contextualized language models: while there are a number of similarity judgment datasets (e.g., SimLex), to our knowledge there are none that compare the similarity of the same wordform in two different contexts. This would be useful for assessing the ability of contextualized language models (like BERT) to capture context-specific, human judgments about relatedness.

Description of norming study

We recruited 81 subjects total from the SONA undergraduate pool at UC San Diego. Each participant saw a series of sentence pairs with the target word bolded, and were asked to determine how related the usage of that target word was across sentences. They were given five labeled options, ranging from "totally unrelated" to "same meaning".

There were 1380 possible sentence pairs: 115 words, 4 versions each, 12 possible comparisons (accounting for order). Each subject saw only 115 critical trials---1 comparison per word (i.e., no subject saw multiple comparisons with the same word). The comparisons any given subject saw for a given word were randomized (i.e., randomly sampled from the 12 possible comparisons), and the order of each item was also randomized.

Finally, we included several comprehension checks to ensure attentiveness. First, we included bot checks at the beginning of the study; participants had to answer questions like "Which of the following is not a place to swim" (the correct answer is "Chair"). We also included two "catch" trials in the body of the main study. In one case, the word "house" was used in exactly the same sentence, meaning that the correct answer would be "same meaning"; in the other, the word "rose" was used in a completely different grammatical context ("red rose" vs. "rose from the chair"), meaning that the correct answer should be "totally unrelated".

Load data

First, we load the processed norming data; responses from individual subjects have already been collapsed into a single .csv file.

### Set working directory (comment this out to run)
# setwd("/Users/seantrott/Dropbox/UCSD/Research/Ambiguity/SSD/raw-c/src/analysis")

### Load preprocessed data
df_normed = read_csv("../../data/processed/polysemy_norming.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   X1 = col_double(),
##   rt = col_double(),
##   trial_index = col_double(),
##   time_elapsed = col_double(),
##   same = col_logical(),
##   item = col_double(),
##   relatedness = col_double(),
##   Age = col_double(),
##   B1 = col_double()
## )
## See spec(...) for full column specifications.
### Filter to critical trials
df_normed_critical = df_normed %>%
  filter(same %in% c(TRUE, FALSE)) %>%
  filter(version_with_order != "catch")

length(unique(df_normed_critical$subject))
## [1] 81
length(unique(df_normed_critical$word))
## [1] 115
nrow(df_normed_critical)
## [1] 9315
table(df_normed_critical$same, df_normed_critical$ambiguity_type_oed)
##        
##         Homonymy Polysemy Unsure
##   FALSE     2040     3624    470
##   TRUE      1038     1884    259
### Recode version information to omit order, so it can be merged with distance information
df_normed_critical$version = fct_recode(
  df_normed_critical$version_with_order,
  M1_a_M1_b = "M1_b_M1_a",
  M1_b_M2_a = "M2_a_M1_b",
  M1_a_M2_a = "M2_a_M1_a",
  M1_a_M2_b = "M2_b_M1_a",
  M1_b_M2_b = "M2_b_M1_b",
  M2_a_M2_b = "M2_b_M2_a"
)

Preprocessing

Bot checks

We then identify and remove subjects who failed either of the bot checks.

df_ppt_bots = df_normed %>%
  filter(type == "bot_check") %>%
  mutate(b1_correct = B1 == 2,
         b2_correct = B2 == "Chair")

df_bot_summ = df_ppt_bots %>%
  group_by(subject) %>%
  summarise(bot_avg = (b1_correct + b2_correct) / 2)
## `summarise()` ungrouping output (override with `.groups` argument)
df_bot_summ
## # A tibble: 81 x 2
##    subject    bot_avg
##    <chr>        <dbl>
##  1 06x378nef2     1  
##  2 08vvfnkfly     0.5
##  3 0dt2huspth     1  
##  4 0s0lmhdwrr     1  
##  5 1cxtrztfca     1  
##  6 1g7e5qu12p     1  
##  7 1n7bm7nw59     1  
##  8 1y74rcu3kk     1  
##  9 2533snm8m5     1  
## 10 25wf33vawc     1  
## # … with 71 more rows
## Now remove ppts from critical stims that have < 100% avearge
df_normed_critical = df_normed_critical %>%
  left_join(df_bot_summ, by = "subject") %>%
  filter(bot_avg == 1)
length(unique(df_normed_critical$subject))
## [1] 80

Analyze catch trials

We also remove subjects who did not receive at least 50% on the catch trials.

### "Rose" should be "totally unrelated", and "blue" should be "same meaning"

df_catch = df_normed %>%
  filter(same %in% c(TRUE, FALSE)) %>%
  filter(version_with_order == "catch") %>%
  mutate(correct_answer = case_when(
    word == "rose" ~ 0,  ## Strict (totally unrelated)
    word == "house" ~ 4  ## Strict (same meaning)
  )) %>%
  mutate(correct_response = relatedness == correct_answer)


### 
df_ppts_catch = df_catch %>%
  group_by(subject) %>%
  summarise(catch_avg = mean(correct_response))
## `summarise()` ungrouping output (override with `.groups` argument)
df_ppts_catch
## # A tibble: 81 x 2
##    subject    catch_avg
##    <chr>          <dbl>
##  1 06x378nef2       1  
##  2 08vvfnkfly       1  
##  3 0dt2huspth       1  
##  4 0s0lmhdwrr       1  
##  5 1cxtrztfca       1  
##  6 1g7e5qu12p       1  
##  7 1n7bm7nw59       0  
##  8 1y74rcu3kk       1  
##  9 2533snm8m5       0.5
## 10 25wf33vawc       1  
## # … with 71 more rows
## Now remove ppts from critical stims that have < 100% avearge
df_normed_critical = df_normed_critical %>%
  left_join(df_ppts_catch, by = "subject") %>%
  filter(catch_avg >= .5) # remove people who got less than 50% on the catch
length(unique(df_normed_critical$subject))
## [1] 77

Demographics statistics

Here, we report general demographic statistics about the participants:

df_demo = df_normed_critical %>%
  group_by(subject, Gender, Mobile_Device, Native_Speaker) %>%
  summarise(age = mean(Age))
## `summarise()` regrouping output by 'subject', 'Gender', 'Mobile_Device' (override with `.groups` argument)
table(df_demo$Mobile_Device) 
## 
##  No Yes 
##  74   3
table(df_demo$Gender) 
## 
##     Female       Male Non-binary 
##         59         16          2
table(df_demo$Native_Speaker) 
## 
##                   No Prefer not to answer                  Yes 
##                    2                    1                   74
mean(df_demo$age)
## [1] 20.22078
sd(df_demo$age)
## [1] 2.702939
median(df_demo$age)
## [1] 20
range(df_demo$age)
## [1] 18 38

Primary analyses

Load modeling data

Here, we merge the results from the neural language model analyses and merge it with our norming data.

df_distances = read_csv("../../data/processed/stims_with_nlm_distances.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Class = col_character(),
##   ambiguity_type = col_character(),
##   ambiguity_type_mw = col_character(),
##   ambiguity_type_oed = col_character(),
##   different_frame = col_character(),
##   distance_bert = col_double(),
##   distance_elmo = col_double(),
##   overlap = col_character(),
##   same = col_logical(),
##   source = col_character(),
##   string = col_character(),
##   version = col_character(),
##   word = col_character()
## )
nrow(df_distances)
## [1] 690
df_merged = df_normed_critical %>%
  left_join(df_distances, by = c("word", "version", "string", "overlap",
                                 "source", "same", "Class", "ambiguity_type"))

nrow(df_merged)
## [1] 8855
length(unique(df_merged$subject))
## [1] 77

H2: Does relatedness differ as a function of ambiguity type?

Here, we want to know whether pairs categorized as homonymous as seen as less related, on average, than words categorized as polysemous.

Of course, if there is an effect of ambiguity_type, we expect it to show up primarily for different sense words. This could be modeled in one of two ways:

  • Using only different sense words, we can ask whether there's a main effect of ambiguity_type.
  • Using all data, we could ask whether there's a significant interaction between ambiguity_type and same sense.

We adopt the latter approach here.

If ambiguity_type matters, it should matter primarily for different sense usages. That is, the effect of ambiguity_type should change as a function of whether a given comparison involves different or same sense usages of a word.

Note that since ambiguity_type is only manipulated across words, this analysis complements Analysis 1 above, which considers only different sense words. It's conceivable that one could observe a main effect of ambiguity_type for different sense words if the stimuli chosen to be homonyms are less related overall (including same sense usages). Thus, this analysis asks whether ambiguity_type has a different relationship with relatedness as a function of same sense vs. different sense.

df_merged %>%
  group_by(same, ambiguity_type) %>%
  summarise(mean_relatedness = mean(relatedness),
            median_relatedness = median(relatedness),
            sd_relatedness = sd(relatedness))
## `summarise()` regrouping output by 'same' (override with `.groups` argument)
## # A tibble: 6 x 5
## # Groups:   same [2]
##   same  ambiguity_type mean_relatedness median_relatedness sd_relatedness
##   <lgl> <chr>                     <dbl>              <dbl>          <dbl>
## 1 FALSE Homonymy                  0.467                  0          0.864
## 2 FALSE Polysemy                  1.75                   2          1.50 
## 3 FALSE Unsure                    3.40                   4          1.02 
## 4 TRUE  Homonymy                  3.22                   4          1.21 
## 5 TRUE  Polysemy                  3.58                   4          0.884
## 6 TRUE  Unsure                    3.81                   4          0.726
model_interaction = lmer(data = df_merged,
                  relatedness ~ same * ambiguity_type + 
                    distance_bert + distance_elmo + 
                    Class +
                    (1 + same + ambiguity_type | subject) +
                    (1 + same | word),
                  control=lmerControl(optimizer="bobyqa"),
                  REML = FALSE)
## boundary (singular) fit: see ?isSingular
model_both = lmer(data = df_merged,
                  relatedness ~ same + ambiguity_type +
                    distance_bert + distance_elmo + 
                    Class +
                    (1 + same+ ambiguity_type | subject) +
                    (1 + same | word),
                  control=lmerControl(optimizer="bobyqa"),
                  REML = FALSE)
## boundary (singular) fit: see ?isSingular
summary(model_interaction)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: relatedness ~ same * ambiguity_type + distance_bert + distance_elmo +  
##     Class + (1 + same + ambiguity_type | subject) + (1 + same |      word)
##    Data: df_merged
## Control: lmerControl(optimizer = "bobyqa")
## 
##      AIC      BIC   logLik deviance df.resid 
##  24904.1  25067.2 -12429.1  24858.1     8832 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.6042 -0.5058  0.0485  0.5419  3.9860 
## 
## Random effects:
##  Groups   Name                   Variance Std.Dev. Corr             
##  word     (Intercept)            0.64525  0.8033                    
##           sameTRUE               0.63725  0.7983   -0.93            
##  subject  (Intercept)            0.07420  0.2724                    
##           sameTRUE               0.14536  0.3813   -0.77            
##           ambiguity_typePolysemy 0.03200  0.1789    0.09 -0.29      
##           ambiguity_typeUnsure   0.07407  0.2722   -0.69  0.94  0.04
##  Residual                        0.87961  0.9379                    
## Number of obs: 8855, groups:  word, 115; subject, 77
## 
## Fixed effects:
##                                 Estimate Std. Error t value
## (Intercept)                      0.89642    0.14769   6.070
## sameTRUE                         2.53393    0.14418  17.575
## ambiguity_typePolysemy           1.27561    0.16392   7.782
## ambiguity_typeUnsure             2.86532    0.48945   5.854
## distance_bert                   -0.58128    0.11709  -4.964
## distance_elmo                   -2.24542    0.47098  -4.768
## ClassV                           0.01607    0.07363   0.218
## sameTRUE:ambiguity_typePolysemy -0.88873    0.16575  -5.362
## sameTRUE:ambiguity_typeUnsure   -2.15173    0.49877  -4.314
## 
## Correlation of Fixed Effects:
##             (Intr) smTRUE ambg_P ambg_U dstnc_b dstnc_l ClassV sTRUE:_P
## sameTRUE    -0.883                                                     
## ambgty_typP -0.717  0.677                                              
## ambgty_typU -0.259  0.251  0.217                                       
## distanc_brt -0.197  0.113  0.000 -0.009                                
## distance_lm -0.287  0.129  0.010  0.025 -0.187                         
## ClassV      -0.079 -0.005 -0.045  0.025  0.062  -0.082                 
## smTRUE:mb_P  0.671 -0.764 -0.904 -0.200 -0.019  -0.012   0.003         
## smTRUE:mb_U  0.237 -0.261 -0.199 -0.909 -0.032  -0.025   0.001  0.221  
## convergence code: 0
## boundary (singular) fit: see ?isSingular
anova(model_interaction, model_both)
## Data: df_merged
## Models:
## model_both: relatedness ~ same + ambiguity_type + distance_bert + distance_elmo + 
## model_both:     Class + (1 + same + ambiguity_type | subject) + (1 + same | 
## model_both:     word)
## model_interaction: relatedness ~ same * ambiguity_type + distance_bert + distance_elmo + 
## model_interaction:     Class + (1 + same + ambiguity_type | subject) + (1 + same | 
## model_interaction:     word)
##                   npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
## model_both          21 24934 25083 -12446    24892                         
## model_interaction   23 24904 25067 -12429    24858 33.675  2  4.869e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
df_tidy = broom.mixed::tidy(model_interaction)

df_tidy %>%
  filter(effect == "fixed") %>%
  ggplot(aes(x = term,
             y = estimate)) +
  geom_point() +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_errorbar(aes(ymin = estimate - 2*std.error, 
                    ymax = estimate + 2*std.error), 
                width=.2,
                position=position_dodge(.9)) +
  labs(x = "Predictor",
       y = "Estimate") +
  theme_minimal()

H3: Do ELMo/BERT explain independent variance?

model_no_bert = lmer(data = df_merged,
                  relatedness ~ same * ambiguity_type + 
                    distance_elmo + 
                    Class +
                    (1 + same + ambiguity_type | subject) +
                    (1 + same | word),
                  control=lmerControl(optimizer="bobyqa"),
                  REML = FALSE)
## boundary (singular) fit: see ?isSingular
anova(model_interaction, model_no_bert)
## Data: df_merged
## Models:
## model_no_bert: relatedness ~ same * ambiguity_type + distance_elmo + Class + 
## model_no_bert:     (1 + same + ambiguity_type | subject) + (1 + same | word)
## model_interaction: relatedness ~ same * ambiguity_type + distance_bert + distance_elmo + 
## model_interaction:     Class + (1 + same + ambiguity_type | subject) + (1 + same | 
## model_interaction:     word)
##                   npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
## model_no_bert       22 24926 25082 -12441    24882                         
## model_interaction   23 24904 25067 -12429    24858 24.404  1  7.809e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_no_elmo = lmer(data = df_merged,
                  relatedness ~ same * ambiguity_type + 
                    distance_bert + 
                    Class +
                    (1 + same + ambiguity_type | subject) +
                    (1 + same | word),
                  control=lmerControl(optimizer="bobyqa"),
                  REML = FALSE)
## boundary (singular) fit: see ?isSingular
anova(model_interaction, model_no_elmo)
## Data: df_merged
## Models:
## model_no_elmo: relatedness ~ same * ambiguity_type + distance_bert + Class + 
## model_no_elmo:     (1 + same + ambiguity_type | subject) + (1 + same | word)
## model_interaction: relatedness ~ same * ambiguity_type + distance_bert + distance_elmo + 
## model_interaction:     Class + (1 + same + ambiguity_type | subject) + (1 + same | 
## model_interaction:     word)
##                   npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
## model_no_elmo       22 24924 25080 -12440    24880                         
## model_interaction   23 24904 25067 -12429    24858 22.386  1   2.23e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Discussion

It appears that Ambiguity Type explains variance in relatedness above and beyond that already explained by cosine distance and same sense. In particular, different sense homonyms appear to be judged as less related, on average, than different sense polysems (which span a wider range).

These visualizations also suggest that the different-sense Unsure items behave more like same sense items from the homonymy/polysemy stimuli. For this reason, we exclude them from future analyses (and from the relatedness dataset).

nrow(df_merged)
## [1] 8855
df_merged = df_merged %>%
  filter(ambiguity_type != "Unsure")
nrow(df_merged)
## [1] 8624
df_merged %>%
  ggplot(aes(x = relatedness)) +
  geom_histogram(bins = 5) +
  theme_minimal() +
  facet_wrap(~same + ambiguity_type)

df_merged %>%
  ggplot(aes(x = relatedness)) +
  geom_histogram(bins = 5,
                 aes(y = ..density..)) +
  theme_minimal() +
  facet_wrap(~same + ambiguity_type,
             ncol = 2)

df_merged %>%
  ggplot(aes(x = relatedness,
             color = same)) +
  geom_freqpoly(bins = 5) +
  theme_minimal() +
  facet_wrap(~ambiguity_type, ncol = 1)

df_merged %>%
  ggplot(aes(x = relatedness,
             color = same)) +
  geom_freqpoly(bins = 5, 
                 aes(y = ..density..)) +
  theme_minimal() +
  facet_wrap(~ambiguity_type, ncol = 1)

Additional visualizations: residuals from NLMs

Here, we visualize the residuals of a model with cosine distance information from both BERT and ELMo, and ask how those residuals relate to ambiguity_type and same sense. This helps illustrate the variance that these NLMs do not explain, which is still nonetheless correlated with Homonymy/Polysemy and Same/Different Sense.

In particular, this visualization suggests:

  • cosine distance from BERT/ELMo systematically underestimates how similar participants find same sense items to be.
  • for homonyms, cosine distance from BERT/ELMo systematically underestimates how different participants find different sense items to be.
model_both_nlms = lmer(data = df_merged, 
                  relatedness ~ distance_elmo + distance_bert +
                    Class +
                    (1| subject) +
                    (1 | word),
                  control=lmerControl(optimizer="bobyqa"),
                  REML = FALSE)

df_merged$resid_nlm = residuals(model_both_nlms)

df_merged %>%
  ggplot(aes(x = resid_nlm,
             y = ambiguity_type,
             fill = same)) +
  geom_density_ridges2(aes(height = ..density..), 
                       color=gray(0.25), 
                       alpha = 0.5, 
                       scale=0.85, 
                       size=.9, 
                       stat="density") +
  labs(x = "Residuals (rel ~ ELMo + BERT)",
       y = "Ambiguity type") +
  geom_vline(xintercept = 0, linetype = "dotted") +
  theme_minimal()

Creating and evaluating relatedness dataset

Finally, we create the contextualized word relatedness dataset.

Get average and SD relatedness for each pair (including version)

First, we collect the mean nad median relatedness judgment for each sentence pair.

df_norms_final = df_merged %>%
  group_by(word, same, ambiguity_type, version, Class) %>%
  summarise(mean_relatedness = mean(relatedness),
            median_relatedness = median(relatedness),
            diff = abs(mean_relatedness - median_relatedness),
            count = n(),
            sd_relatedness = sd(relatedness),
            distance_bert = mean(distance_bert),
            distance_elmo = mean(distance_elmo),
            se_relatedness = sd_relatedness / sqrt(n()))
## `summarise()` regrouping output by 'word', 'same', 'ambiguity_type', 'version' (override with `.groups` argument)
summary(df_norms_final$count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   10.00   13.00   12.83   15.00   23.00
nrow(df_norms_final)
## [1] 672
table(df_norms_final$ambiguity_type)
## 
## Homonymy Polysemy 
##      228      444

After removing the "Unsure" items, there are 672 sentence pairs total. The minimum number of observations for any given pair is 4, and the median number of observations is 13.

We can get a sense for how the items distribute by creating a density plot:

df_norms_final %>%
  ggplot(aes(x = mean_relatedness,
             y = ambiguity_type,
             fill = same)) +
  geom_density_ridges2(aes(height = ..density..), 
                       color=gray(0.25), 
                       alpha = 0.5, 
                       scale=0.85, 
                       size=.9, 
                       stat="density") +
  labs(x = "Mean relatedness judgment",
       y = "Ambiguity type") +
  theme_minimal() +
  theme(axis.title = element_text(size=rel(2)),
        axis.text = element_text(size = rel(2)),
        legend.text = element_text(size = rel(2)),
        legend.title = element_text(size = rel(2)))

ggsave("../../Figures/mean_norms.pdf", dpi = 300)
## Saving 7 x 5 in image

We then save these norms to disk.

write.csv(df_norms_final, "../../data/stims/item_means.csv")

Evaluating against BERT and ELMo

Finally, we ask how well each of the cosine distance measures correlate with the mean relatedness judgments. In each case, we compute both Pearson's r and Spearman's rho.

BERT

cor.test(df_norms_final$distance_bert,
         df_norms_final$mean_relatedness,
         method = "spearman")
## Warning in cor.test.default(df_norms_final$distance_bert,
## df_norms_final$mean_relatedness, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  df_norms_final$distance_bert and df_norms_final$mean_relatedness
## S = 79832165, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.578419

ELMo

cor.test(df_norms_final$distance_elmo,
         df_norms_final$mean_relatedness,
         method = "spearman")
## Warning in cor.test.default(df_norms_final$distance_elmo,
## df_norms_final$mean_relatedness, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  df_norms_final$distance_elmo and df_norms_final$mean_relatedness
## S = 77339540, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5291355

Residual variance

model_nlm = lm(data = df_norms_final,
               mean_relatedness ~ distance_elmo + distance_bert)

summary(model_nlm)$r.squared
## [1] 0.365837
df_norms_final$resid = residuals(model_nlm)


df_norms_final %>%
  ggplot(aes(x = resid,
             y = ambiguity_type,
             fill = same)) +
  geom_density_ridges2(aes(height = ..density..), 
                       color=gray(0.25), 
                       alpha = 0.5, 
                       scale=0.85, 
                       size=.9, 
                       stat="density") +
  labs(x = "Residuals (relatedness ~ ELMo + BERT)",
       y = "Ambiguity type") +
  geom_vline(xintercept = 0, linetype = "dotted") +
  theme_minimal() +
  theme(axis.title = element_text(size=rel(2)),
        axis.text = element_text(size = rel(2)),
        legend.text = element_text(size = rel(2)),
        legend.title = element_text(size = rel(2)))

ggsave("../../Figures/residuals.pdf", dpi = 300)
## Saving 7 x 5 in image
df_norms_final %>%
  ggplot(aes(x = resid)) +
  geom_histogram(bins = 15,
                 aes(y = ..density..),
                 alpha = .5) +
  geom_density() +
  labs(x = "Residuals (relatedness ~ ELMo + BERT)") +
  geom_vline(xintercept = 0, linetype = "dotted") +
  theme_minimal() +
  facet_wrap(~ambiguity_type + same) +
  theme(axis.title = element_text(size=rel(2)),
        axis.text = element_text(size = rel(2)),
        legend.text = element_text(size = rel(2)),
        legend.title = element_text(size = rel(2)),
        strip.text.x = element_text(size = rel(2)))

ggsave("../../Figures/residuals_hist.pdf", dpi = 300)
## Saving 7 x 5 in image
model_categories = lm(data = df_norms_final,
               mean_relatedness ~ same * ambiguity_type)

summary(model_categories)$r.squared
## [1] 0.6596079
model_all = lm(data = df_norms_final,
               mean_relatedness ~ same * ambiguity_type + distance_bert + distance_elmo)

summary(model_all)$r.squared
## [1] 0.7105156

Conclusion

Overall, this norming study indicates that each of the variables of interest explain variance in relatedness judgments.

This is interesting from a theoretical perspective, as it suggests that the cognitive resources or semantic representations that participants call forth to make relatedness judgments involve each of these constructs: the contextual distance between two usages, whether or not those usages cross a sense boundary, and the type of ambiguity at play. Of course, the task encouraged and even made reference to the latter two contrasts: participants could indicate whether the two usages were the "same meaning" or "totally unrelated". In future work, we will use these stimuli in a primed sensibility judgment task, and ask whether, and how, more implicit measures of processing (e.g., Accuracy and RT) are predicted by these variables.

Hopefully, these final relatedness norms are also useful from an applied perspective. This dataset could be useful for more explicit identification of where neural language models fall short; for example, a regression model trained to predict relatedness from cosine distance does fairly well overall, but appears to underestimate how related participants judge same sense usages to be, and overestimate how related participants judge different sense usages to be (particularly for homonymous wordforms). Thus, while these models clearly track some differences in contexts of use that correspond to human conceptions of semantic relatedness, humans might further group these contexts of use into fuzzy categories, or senses, in a way that BERT/ELMo do not.